A Suffix Tree Approach to Text Categorisation Applied to Spam Filtering

نویسندگان

  • Rajesh M. Pampapathi
  • Boris Mirkin
  • Mark Levene
چکیده

We present an approach to textual classification based on the suffix tree data structure and apply it to spam filtering. A method for scoring of documents using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of documents and classes facilitated by the suffix tree significantly improves classification accuracy when compared with the currently popular naive Bayesian filtering method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotated suffix trees for text modelling and classification

Suffix trees are compact and versatile data structures in which paths from the root to nodes represent substrings of the encoded text. By annotating such a tree with the frequencies of substrings, it is possible to construct a compact model of text that captures its sequential nature. This thesis investigates the use of such a model in the representation and classification of text. The basic ap...

متن کامل

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

In recent years anti-spam filters have become necessary tools for Internet service providers to face up to the continuously growing spam phenomenon. Current server-side anti-spam filters are made up of several modules aimed at detecting different features of spam e-mails. In particular, text categorisation techniques have been investigated by researchers for the design of modules for the analys...

متن کامل

Prediction of Fault-Prone Software Modules Using a Generic Text Discriminator

This paper describes a novel approach for detecting faultprone modules using a spam filtering technique. Fault-prone module detection in source code is important for the assurance of software quality. Most previous fault-prone detection approaches have been based on using software metrics. Such approaches, however, have difficulties in collecting the metrics and constructing mathematical models...

متن کامل

A Suffix Tree Approach to Email Filtering

We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared ...

متن کامل

Email Classification Using Machine Learning Algorithms

Email has become one of the frequently used forms of communication. Everyone has at least one email account. Inflow of spam messages is a major problem faced by email users. Currently there are many spam filtering techniques. As the spam filtering techniques came up, spammers improved their methods of spamming. Thus, an effective spam filtering technique is the timely requirement. In this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008